Text and Models

Digital Humanities is often concerned with creating models of text: a general name for a kind of representation of text which makes it in some way easier to interpret. TEI-encoded text is an example of a model: we take the raw material of a text document and add elements to it to make it easier to work with and analyse. Models are often further abstracted from the original text. One way we can represent text in a way that a machine can interpret is with a word vector. A word vector is simply a numerical representation of a word within a corpus (a body of text, often a series of documents), usually consisting of a series of numbers in a specified sequence. This type of representation is used for a variety of Natural Language Processing tasks - for instance measuring the similarity between two documents. This post uses a couple of R packages and a method for creating word vectors with a neural net, called GloVe, to produce a series of vectors which give useful clues as to the semantic links between words in a corpus. The method is then used to analyse the printed summaries of the English State Papers, from State Papers Online, and show how they can be used to understand how the association between words and concepts changed over the course of the seventeenth century.

What is a Word Vector, Then?

Imagine you have two documents in a corpus. One of them is an article about pets, and the other is a piece of fiction about a team of crime fighting animal superheroes. We’ll call them document A and document B. One way to represent the words within these documents as a vector would be to use the counts of each word per document.

To do this, you could give each word a set of coordinates, \(x\) and \(y\), where \(x\) is a count of how many times the word appears in document A and \(y\) the number of times it appears in document B.

The first step is to make a dataframe with the relevant counts:

library(ggrepel)
library(tidyverse)
word_vectors = tibble(word = c('crufts', 'feed', 'cat', 'dog', 'mouse', 'rabbit', 'cape', 'hero' ),
      x = c(10, 8, 6, 5, 6, 5, 2, 1),
      y = c(0, 1, 3, 5, 8, 8, 10, 9))

word_vectors

This data can be represented as a two-dimensional plot where each word is placed on the x and y axes based on their x and y values, like this:

ggplot() + 
  geom_point(data = word_vectors, aes(x, y), size =4, alpha = .7) + 
  geom_text_repel(data = word_vectors, aes(x, y, label = word)) + 
  theme_bw() + 
  labs(title = "Words Represented in Two-dimension Space") + 
  theme(title = element_text(face = 'bold')) + 
  scale_x_continuous(breaks = 1:10) + 
  scale_y_continuous(breaks = 1:10)

Each word is represented as a vector of length 2: ‘rabbit’ is a vector containing two numbers: {5,8}, for example. Using very basic maths we can calculate the euclidean distance between any pair of words. More or less the only thing I can remember from secondary school math is how to calculate the distance between two points on a graph, using the following formula:

\[ \sqrt {\left( {x_1 - x_2 } \right)^2 + \left( {y_1 - y_2 } \right)^2 } \]

Where \(x\) is the first point and \(y\) the second. This can easily be turned into a function in R, which takes a set of coordinates (the arguments x1 and x2) and returns the euclidean distance:

euc.dist <- function(x1, x2) sqrt(sum((pointA - pointB) ^ 2))

To get the distance between crufts and mouse, set pointA as the \(x\) and \(y\) ccoordinates for the first entry in the dataframe of coordinates we created above, and pointB the coordinates for the fifth entry:

pointA = c(word_vectors$x[1], word_vectors$y[1])
pointB = c(word_vectors$x[5], word_vectors$y[5])

euc.dist(pointA, pointB)
## [1] 8.944272

Representing a pair of words as vectors and measuring the distance between them is commonly used to suggest a semantic link between the two. For instance, the distance between hero and cape in this corpus is small, because they have similar properties: they both occur mostly in the document about superheroes and rarely in the document about pets.

pointA = c(word_vectors$x[word_vectors$word == 'hero'], word_vectors$y[word_vectors$word == 'hero'])

pointB = c(word_vectors$x[word_vectors$word == 'cape'], word_vectors$y[word_vectors$word == 'cape'])

euc.dist(pointA, pointB)
## [1] 1.414214

This suggests that the model has ‘learned’ that in this corpus, hero and cape are semantically more closely linked than other pairs in the dataset. The difference between cape and feed, on the other hand, is large, because one appears often in the superheroes article and rarely in the other, and vice versa.

pointA = c(word_vectors$x[word_vectors$word == 'cape'], word_vectors$y[word_vectors$word == 'cape'])

pointB = c(word_vectors$x[word_vectors$word == 'feed'], word_vectors$y[word_vectors$word == 'feed'])

euc.dist(pointA, pointB)
## [1] 10.81665

Multi-Dimensional Vectors

These vectors, each consisting of two numbers, can be thought of as two-dimensional vectors: a type which can be represented on a 2D scatterplot as \(x\) and \(y\). It’s very easy to add a third dimension, \(z\):

word_vectors_3d = tibble(word = c('crufts', 'feed', 'cat', 'dog', 'mouse', 'rabbit', 'cape', 'hero' ),
      x = c(10, 8, 6, 5, 6, 5, 2, 1),
      y = c(0, 1, 3, 5, 8, 8, 10, 9),
      z = c(1,3,5,2,7,8,4,3))

Just like the plot above, we can plot the words, this time in in three dimensions, using Plotly:

library(plotly)

plot_ly(data = word_vectors_3d, x =  ~x, y = ~y,z =  ~z, text = ~word) %>% add_markers()

You can start to understand how the words now cluster together in the 3D plot: rabbit and mouse are clustered together, but now in the third dimension they are further away from dog. We can use the same formula as above to calculate these distances, just by adding the z coordinates to the pointA and pointB vectors:

pointA = c(word_vectors$x[word_vectors$word == 'dog'], word_vectors$y[word_vectors$word == 'dog'], word_vectors$z[word_vectors$word == 'dog'])
## Warning: Unknown or uninitialised column: `z`.
pointB = c(word_vectors$x[word_vectors$word == 'mouse'], word_vectors$y[word_vectors$word == 'mouse'], word_vectors$z[word_vectors$word == 'mouse'])
## Warning: Unknown or uninitialised column: `z`.
euc.dist(pointA, pointB)
## [1] 3.162278

The nice thing about the method is that while my brain starts to hurt when I think about more than three dimensions, the maths behind it doesn’t care: you can just keep plugging in longer and longer vectors and it’ll continue to calculate the distances as long as they are the same length. This means you can use this same formula not just when you have x and y coordinates, but also z, a, b, c, d, and so on for as long as you like. This is often called ‘representing words in multi-dimensional euclidean space’, or something similar which sounds great on grant applications but it’s really just doing some plotting and measuring distances. Which means that if you represent all the words in a corpus as a long vector (series of coordinates), you can quickly measure the distance between any two.

In a large corpus with a properly-constructed vector representation, the semantic relationships between the words start to make a lot of sense. What’s more, because of vector math, you can add, subtract, divide and multiply the words together to get new vectors, and then find the closest to that. Here, we create a new vector, which is pointA - pointB (dog - mouse). Then loop through each vector and calculate the distance, and display in a new dataframe:

pointC = pointA - pointB

df_for_results = tibble()
for(i in 1:8){
  
  pointA = c(word_vectors$x[i], word_vectors$y[i], word_vectors$z[i])
  u = tibble(dist = euc.dist(pointC, pointA), word = word_vectors$word[i])
  df_for_results = rbind(df_for_results, u)
}
## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.

## Warning: Unknown or uninitialised column: `z`.
df_for_results %>% arrange(dist)

The closest to dog - mouse is hero, with this vector representation.

From Vectors to Word Embeddings

These vectors are also known as word embeddings. Real algorithms base the vectors on more sophisticated metrics than that I used above. Some, such as GloVe record co-occurrence probabilities (the likelihood of every pair of words in a corpus to co-occur within a set ‘window’ of words either side), using a neural network, and pre-trained over enormous corpora of text. The resulting vectors are often used to represent the relationships between modern meanings of words, to track semantic changes over time, or to understand the history of concepts, though it’s worth pointing out they’re only as representative as the corpus used (many use sources such as Wikipedia, or Reddit, mostly produced by white men and so there’s a danger of biases towards those groups).

Word embeddings are often critiqued as reflecting or propogating bias (I highly recommend Kaspar Beelen’s post and tools to understand more about this) of their source texts. The source used here is a corpus consisting of the printed summaries of the Calendars of State Papers, which I’ve described in detail here. As such it is likely highly biased, but if the purpose of an analysis is historical, for example to understand how a concept was represented at a given time, by a specific group, in a particular body of text, the biases captured by word embeddings can be seen as a research strength rather than a weakness. The data is in no way representative of early modern text more generally, and, what’s more, the summaries were written in the 19th century and so will reflect what editors at the time thought was important. In these two ways, the corpus will reproduce a very particular wordview of a very specific group, at a very specific time. Because of this, can use the embeddings to get an idea of how certain words or ideas were semantically linked, specifically in the corpus of calendar abstracts. The data will not show us how early modern concepts were related, but it might show conceptual changes in words within the information apparatus of the state.

The following instructions are adapted from the project vignette and this tutorial. First, tokenise all the abstract text and remove very common words called stop words:

library(text2vec)
library(tidytext)
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
## 
##     tokenize
data("stop_words")

Next, load and pre-process the abstract text:

spo_raw = read_delim('/Users/yannryanpersonal/Documents/blog_posts/MOST RECENT DATA/fromto_all_place_mapped_stuart_sorted', delim = '\t', col_names = F )
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   X2 = col_double(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character(),
##   X7 = col_character(),
##   X8 = col_character()
## )
spo_mapped_people = read_delim('/Users/yannryanpersonal/Documents/blog_posts/MOST RECENT DATA/people_docs_stuart_200421', delim = '\t', col_names = F)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character(),
##   X7 = col_logical()
## )
load('/Users/yannryanpersonal/Documents/blog_posts/g')
g = g %>% group_by(path) %>% summarise(value = paste0(value, collapse = "<br>"))
## `summarise()` ungrouping output (override with `.groups` argument)
spo_raw = spo_raw %>%
mutate(X7 = str_replace(X7, "spo", "SPO")) %>%
separate(X7, into = c('Y1', 'Y2', 'Y3'), sep = '/') %>%
mutate(fullpath = paste0("/Users/Yann/Documents/non-Github/spo_xml/", Y1, '/XML/', Y2,"/", Y3)) %>% mutate(uniquecode = paste0("Z", 1:nrow(spo_raw), "Z"))

withtext = left_join(spo_raw, g, by = c('fullpath' = 'path')) %>%
left_join(spo_mapped_people %>% dplyr::select(X1, from_name = X2), by = c('X1' = 'X1'))%>%
left_join(spo_mapped_people %>% dplyr::select(X1, to_name = X2), by = c('X2' = 'X1')) 

Tokenize the text using the Tidytext function unnest_tokens(), remove stop words, lemmatize the text (reduce the words to their stem) using textstem, and remove numbers. This creates a new dataset, with one row per word, plus.

words = withtext %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% anti_join(stop_words)%>% 
  mutate(word = lemmatize_words(word)) %>% 
  filter(!str_detect(word, "[0-9]{1,}")) %>% mutate(word = str_remove(word, "\\'s"))
## Joining, by = "word"

Create a ‘vocabulary’, which is just a list of each word found in the dataset and the times they occur, and ‘prune’ it to only words which occur at least five times.

words_ls = list(words$word)

it = itoken(words_ls, progressbar = FALSE)

vocab = create_vocabulary(it)

vocab = prune_vocabulary(vocab, term_count_min = 5)

With the vocabulary, construct a ‘term co-occurence matrix’: this is a matrix of rows and columns, counting all the times each word co-occurs with every other word, within a window which can be set with the argument skip_grams_window =. 5 seems to give me good results - I think because many of the documents are so short.

vectorizer = vocab_vectorizer(vocab)

# use window of 10 for context words
tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

Now use the GloVe algorithm to train the model and produce the vectors, with a set number of iterations: here we’ve used 20, which seems to give good results. rank here is the number of dimensions we want. x_max is the maximum number of co-occurrences the model will consider in total - giving it a relatively low maximum means that the whole thing won’t be skewed towards a small numbre of words that occur together hundreds of times. rank sets the number of dimensions in the result. The algorithm can be quite slow, but as it’s a relatively small dataset (in comparison to something like the entire English wikipedia), it shouldn’t take too long to run - a couple of minutes for 20 iterations.

glove = GlobalVectors$new(rank = 100, x_max = 100)

wv_main = glove$fit_transform(tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [07:04:25.341] epoch 1, loss 0.0539 
## INFO  [07:04:38.503] epoch 2, loss 0.0318 
## INFO  [07:04:51.829] epoch 3, loss 0.0261 
## INFO  [07:05:05.224] epoch 4, loss 0.0234 
## INFO  [07:05:18.537] epoch 5, loss 0.0216 
## INFO  [07:05:31.924] epoch 6, loss 0.0204 
## INFO  [07:05:45.319] epoch 7, loss 0.0195 
## INFO  [07:05:58.691] epoch 8, loss 0.0187 
## INFO  [07:06:12.061] epoch 9, loss 0.0181 
## INFO  [07:06:25.444] epoch 10, loss 0.0176 
## INFO  [07:06:39.007] epoch 11, loss 0.0172 
## INFO  [07:06:52.559] epoch 12, loss 0.0168 
## INFO  [07:07:05.968] epoch 13, loss 0.0165 
## INFO  [07:07:19.290] epoch 14, loss 0.0162 
## INFO  [07:07:32.628] epoch 15, loss 0.0159 
## INFO  [07:07:46.021] epoch 16, loss 0.0157 
## INFO  [07:07:59.371] epoch 17, loss 0.0155 
## INFO  [07:08:12.716] epoch 18, loss 0.0153 
## INFO  [07:08:26.073] epoch 19, loss 0.0151 
## INFO  [07:08:39.417] epoch 20, loss 0.0149

GloVe results in two sets of word vectors, the main and the context. The authors of the GloVe package suggest that combining both results in higher-quality embeddings:

wv_context = glove$components



# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
word_vectors = wv_main + t(wv_context)

Reducing Dimensionality for Visualisation

Now that’s done, it’d be nice to visualise the results as a whole. This isn’t actually necessary: as I mentioned earlier, the computer doesn’t care how many dimensions you give it to work out the distances between words. However, in order to visualise the results as whole, we can reduce the 100 dimensions to two or three and plot the results. We can do this with an algorithm called UMAP

There are a number of parameters which can be set - most important is n_components which should be set to two or three so that the results can be plotted.

library(umap)
glove_umap <- umap(word_vectors, n_components = 3, metric = "cosine", n_neighbors = 25, min_dist = 0.01, spread=2)

df_glove_umap <- as.data.frame(glove_umap$layout, stringsAsFactors = FALSE)

# Add the labels of the words to the dataframe
df_glove_umap$word <- rownames(df_glove_umap)
colnames(df_glove_umap) <- c("UMAP1", "UMAP2", "UMAP3", "word")
df_glove_umap$technique <- 'GloVe'
cat(paste0('\n', 'Our GloVe embedding reduced to 2 dimensions:', '\n'))
## 
## Our GloVe embedding reduced to 2 dimensions:
str(df_glove_umap)
## 'data.frame':    22253 obs. of  5 variables:
##  $ UMAP1    : num  -2.758 -0.926 -1.513 -0.749 0.164 ...
##  $ UMAP2    : num  -0.0562 1.2093 1.8558 0.2308 0.6367 ...
##  $ UMAP3    : num  -2.164 0.194 -3.106 2.76 0.53 ...
##  $ word     : chr  "aalst" "aarsele" "abdias" "abernethy" ...
##  $ technique: chr  "GloVe" "GloVe" "GloVe" "GloVe" ...

Next, use Plotly as above to visualise the resulting three dimensions:

plot_ly(data = df_glove_umap, x =  ~UMAP1, y = ~UMAP2, z =  ~UMAP3, text = ~word, alpha = .2, size = .1) %>% add_markers()

Results

When it’s finished, write a small function which calculates and displays the closest words in distance to a given word. Instead of using the euclidean distance formula above, we calculate the cosine similarity, which measures the angular distance between the words (this is better because it corrects for one word appearing many times and another appearing very infrequently).

ten_closest_words = function(word){

word_result = word_vectors[word, , drop = FALSE] 

cos_sim = sim2(x = word_vectors, y = word_result, method = "cosine", norm = "l2")


head(sort(cos_sim[,1], decreasing = TRUE), 30)

}

The function takes a single word as an argument and returns the twenty closest word vectors, by cosine distance. What are the closest in useage to ‘king’?

ten_closest_words('king')
##       king    majesty    england      queen       lord      leave     desire 
##  1.0000000  0.8638741  0.7727203  0.7408834  0.7375371  0.7350961  0.7342174 
##       late       hope understand     prince    promise    command       hear 
##  0.7268651  0.7255197  0.7242622  0.7237931  0.7223181  0.7198124  0.7177404 
##     please     favour       time    receive       duke      grant     return 
##  0.7168768  0.7156008  0.7134387  0.7122003  0.7120294  0.7104037  0.7092559 
##     letter       pray     intend       tell         br      bring     inform 
##  0.7069623  0.7064916  0.7001860  0.6952622  0.6943447  0.6849711  0.6843482 
##     answer    service 
##  0.6826238  0.6813778

Unsurprisingly, a word that is often interchangeable with King, Majesty, is the closest, followed by ‘Queen’ - also obviously interchangeable with King, depending on the circumstances.

Word embeddings are often used to understand different and changing gender representations. How are gendered words represented in the State Papers abstracts? First of all, wife:

ten_closest_words('wife')
##      wife     child   husband    sister  daughter      lady   brother     marry 
## 1.0000000 0.8251752 0.8014678 0.7684266 0.7494467 0.7361869 0.7356229 0.7318627 
##    mother       son    father    family     widow     woman   servant     uncle 
## 0.7223632 0.7043287 0.6813588 0.6689104 0.6684048 0.6494668 0.6306394 0.6206129 
##    friend      live    writer     leave       die      life      dead      poor 
## 0.6180055 0.6039532 0.5888265 0.5827822 0.5773624 0.5757947 0.5746942 0.5744416 
##      pray    estate     niece  countess  remember    health 
## 0.5529604 0.5508137 0.5408341 0.5368639 0.5334508 0.5332996

Unsurprisingly wife is most similar to other words relating to family. What about husband?

ten_closest_words('husband')
##      husband         wife        child        widow       father        woman 
##    1.0000000    0.8014678    0.7618610    0.7226149    0.6372039    0.6152997 
##       mother      servant imprisonment          son     daughter        marry 
##    0.6110061    0.6067118    0.6057760    0.6030784    0.6018447    0.6010796 
##         debt      brother   petitioner       sister         lady     prisoner 
##    0.5966612    0.5907385    0.5898349    0.5787378    0.5779477    0.5770660 
##      release      decease          die       family       access         late 
##    0.5753734    0.5749962    0.5731349    0.5682910    0.5578974    0.5538662 
##        death       estate     petition         life      liberty         dead 
##    0.5522774    0.5430116    0.5421756    0.5248609    0.5228890    0.5221336

Husband is mostly similar but with some interesting different associations: ‘widow’, ‘die’, ‘petition’, ‘debt’, and ‘prisoner’, reflecting the fact that there is a large group of petitions in the State Papers written by women looking for pardons or clemency for their husbands, particularly following the Monmouth Rebellion in 1683.

Looking at the closest words to place names gives some interesting associations. Amsterdam is associated with terms related to shipping and trade:

ten_closest_words('amsterdam')
##   amsterdam   rotterdam        lade    bordeaux    merchant        bind 
##   1.0000000   0.7843269   0.6418232   0.5993143   0.5912159   0.5540331 
##      vessel     holland      french       flush      london     hamburg 
##   0.5517031   0.5415674   0.5298537   0.5281513   0.5192847   0.5168510 
##       dutch    dutchman      richly       prize merchantmen       texel 
##   0.5163422   0.5129344   0.5121920   0.5098451   0.5085725   0.5051106 
##      nantes       sugar        malo        salt         hoy        sail 
##   0.5013724   0.4991486   0.4950068   0.4943924   0.4911585   0.4891844 
##      arrive      bilboa        ship      ostend     english   privateer 
##   0.4885341   0.4851528   0.4845449   0.4842282   0.4824040   0.4792499

Whereas Rome is very much associated with religion and ecclesiastical politics:

ten_closest_words('rome')
##          rome          pope         friar        jesuit         spain 
##     1.0000000     0.6710667     0.5532513     0.5459079     0.5364638 
##         paris        nuncio       courier    ambassador        venice 
##     0.5209065     0.5043323     0.4802523     0.4785569     0.4780924 
##        priest       germany         italy       england        tyrone 
##     0.4729829     0.4671189     0.4633001     0.4626559     0.4480714 
##      cardinal        church        naples advertisement        france 
##     0.4451541     0.4429859     0.4420447     0.4350839     0.4344935 
##          bull        depart      catholic          mass      religion 
##     0.4250987     0.4195093     0.4188933     0.4115858     0.4103830 
##     archdukes        vienna       shortly       emperor      brussels 
##     0.4089576     0.4078097     0.4014825     0.4006026     0.3967409

More Complex Vector Tasks

As well as finding the most similar words, we can also perform arithmetic on the vectors. What is the closest word to book and news:

sum = word_vectors["book", , drop = F] +
 word_vectors["news", , drop = F]

cos_sim_test = sim2(x = word_vectors, y = sum, method = "cosine", norm = "l2")

head(sort(cos_sim_test[,1], decreasing = T), 20)
##       news       book       post    account      write       send     letter 
##  0.8198345  0.8004811  0.6792542  0.6754230  0.6741882  0.6653849  0.6618579 
##       hear      bring williamson    enclose       hope    england      print 
##  0.6519728  0.6433154  0.6388822  0.6314165  0.6198477  0.6180945  0.6152369 
##        day     return       note       hand       week      leave 
##  0.6151938  0.6150748  0.6055430  0.6024733  0.6016956  0.6010381

It is also a way of finding analogies: so, for example, Paris - France + Germany should equal to ‘Berlin’, because Berlin is like the Paris of France. Is that what we get?

test = word_vectors["paris", , drop = F] -
  word_vectors["france", , drop = F] +
  word_vectors["germany", , drop = F]
  
#+
 # shakes_word_vectors["letter", , drop = F]

cos_sim_test = sim2(x = word_vectors, y = test, method = "cosine", norm = "l2")

head(sort(cos_sim_test[,1], decreasing = T), 20)
##         paris       germany           n.s        madrid            ps 
##     0.6679108     0.6246328     0.5039738     0.4432661     0.4409361 
##      ratisbon        bursar     frankfort  koningsmarck           coe 
##     0.4209300     0.4045458     0.3960551     0.3905299     0.3893929 
## advertisement         style         hague      brussels        vienna 
##     0.3889908     0.3813583     0.3787224     0.3697133     0.3639487 
##         shirt    occurrence    remarkable        cypher        ernley 
##     0.3634198     0.3595532     0.3575269     0.3573064     0.3494931

After Germany and Paris, the most similar to Paris - France + Germany is Brussels: not the correct answer, but a close enough guess!

We can try other analogies: pen - letter + book should in theory give some word related to printing and book production such as print, or press, or maybe type (Think pen is to letter as X is to book).

test = word_vectors["pen", , drop = F] -
  word_vectors["letter", , drop = F] +
    word_vectors["book", , drop = F]
  
cos_sim_test = sim2(x = word_vectors, y = test, method = "cosine", norm = "l2")

head(sort(cos_sim_test[,1], decreasing = T), 20)
##        pen        ink       book   pamphlet manuscript    wrangle unlicensed 
##  0.5655214  0.5241191  0.5108703  0.4790678  0.4307206  0.4299565  0.4262404 
##      quire       chop      bible  edgecombe       ream        vii    barlowe 
##  0.4190708  0.4188197  0.4159428  0.4116286  0.4105579  0.4077308  0.4049325 
##      cloak     bundle     schism     liquor       fool     dredge 
##  0.3986532  0.3899056  0.3884587  0.3878514  0.3875513  0.3874864

Not bad - printer is in the top 20! The closest is ink, plus some other book-production-related words like pamphlet. Though some of these words can also be associated with manuscript production, we could be generous and say that they are sort of to a book as a pen is to a letter!

Change in Semantic Relations Over Time

We can also look for change in semantic meaning over time. First, divide the text into four separate sections, one for each reign:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
james_i = withtext %>% 
  mutate(year = year(ymd(X4))) %>% 
  filter(year %in% 1603:1624) %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% 
  anti_join(stop_words) %>% 
  mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ  24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.
## Warning: 24 failed to parse.
## Joining, by = "word"
charles_i = withtext %>% 
  mutate(year = year(ymd(X4))) %>% 
  filter(year %in% 1625:1648) %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% anti_join(stop_words)%>% 
  mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ  24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.

## Warning:  24 failed to parse.
## Joining, by = "word"
commonwealth = withtext %>% 
  mutate(year = year(ymd(X4))) %>% 
  filter(year %in% 1649:1659) %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% anti_join(stop_words)%>% 
  mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ  24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.

## Warning:  24 failed to parse.
## Joining, by = "word"
charles_ii = withtext %>% 
  mutate(year = year(ymd(X4))) %>% 
  filter(year %in% 1660:1684) %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% anti_join(stop_words)%>% 
  mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ  24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.

## Warning:  24 failed to parse.
## Joining, by = "word"
james_ii_w_m_ann = withtext %>% 
  mutate(year = year(ymd(X4))) %>% 
  filter(year %in% 1685:1714) %>% 
  ungroup()  %>% 
  select(document = X5, value, date = X3) %>%
  unnest_tokens(word, value) %>% anti_join(stop_words) %>% 
  mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ  24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.

## Warning:  24 failed to parse.
## Joining, by = "word"

Now run the same scripts as above, on each of these sections:

james_i_words_ls = list(james_i$word)
it = itoken(james_i_words_ls, progressbar = FALSE)
james_i_vocab = create_vocabulary(it)
james_i_vocab = prune_vocabulary(james_i_vocab, term_count_min = 5)

vectorizer = vocab_vectorizer(james_i_vocab)

# use window of 10 for context words
james_i_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

james_i_glove = GlobalVectors$new(rank = 100, x_max = 100)

james_i_wv_main = james_i_glove$fit_transform(james_i_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [07:13:51.511] epoch 1, loss 0.0491 
## INFO  [07:13:54.733] epoch 2, loss 0.0291 
## INFO  [07:13:57.951] epoch 3, loss 0.0231 
## INFO  [07:14:01.154] epoch 4, loss 0.0203 
## INFO  [07:14:04.353] epoch 5, loss 0.0185 
## INFO  [07:14:07.563] epoch 6, loss 0.0172 
## INFO  [07:14:10.778] epoch 7, loss 0.0162 
## INFO  [07:14:13.984] epoch 8, loss 0.0154 
## INFO  [07:14:17.178] epoch 9, loss 0.0147 
## INFO  [07:14:20.412] epoch 10, loss 0.0142 
## INFO  [07:14:23.623] epoch 11, loss 0.0137 
## INFO  [07:14:26.837] epoch 12, loss 0.0133 
## INFO  [07:14:30.065] epoch 13, loss 0.0129 
## INFO  [07:14:33.294] epoch 14, loss 0.0126 
## INFO  [07:14:36.509] epoch 15, loss 0.0123 
## INFO  [07:14:39.702] epoch 16, loss 0.0120 
## INFO  [07:14:42.909] epoch 17, loss 0.0118 
## INFO  [07:14:46.120] epoch 18, loss 0.0115 
## INFO  [07:14:49.328] epoch 19, loss 0.0113 
## INFO  [07:14:52.560] epoch 20, loss 0.0111
james_i_wv_context = james_i_glove$components

james_i_word_vectors = james_i_wv_main + t(james_i_wv_context)
charles_i_words_ls = list(charles_i$word)
it = itoken(charles_i_words_ls, progressbar = FALSE)
charles_i_vocab = create_vocabulary(it)
charles_i_vocab = prune_vocabulary(charles_i_vocab, term_count_min = 5)

vectorizer = vocab_vectorizer(charles_i_vocab)

# use window of 10 for context words
charles_i_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

charles_i_glove = GlobalVectors$new(rank = 100, x_max = 100)

charles_i_wv_main = charles_i_glove$fit_transform(charles_i_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [07:15:01.562] epoch 1, loss 0.0537 
## INFO  [07:15:05.691] epoch 2, loss 0.0298 
## INFO  [07:15:09.677] epoch 3, loss 0.0234 
## INFO  [07:15:13.700] epoch 4, loss 0.0207 
## INFO  [07:15:17.690] epoch 5, loss 0.0189 
## INFO  [07:15:21.675] epoch 6, loss 0.0176 
## INFO  [07:15:25.658] epoch 7, loss 0.0166 
## INFO  [07:15:29.670] epoch 8, loss 0.0158 
## INFO  [07:15:33.694] epoch 9, loss 0.0152 
## INFO  [07:15:37.730] epoch 10, loss 0.0146 
## INFO  [07:15:41.763] epoch 11, loss 0.0141 
## INFO  [07:15:45.776] epoch 12, loss 0.0137 
## INFO  [07:15:49.800] epoch 13, loss 0.0134 
## INFO  [07:15:53.845] epoch 14, loss 0.0130 
## INFO  [07:15:57.842] epoch 15, loss 0.0127 
## INFO  [07:16:01.910] epoch 16, loss 0.0125 
## INFO  [07:16:05.911] epoch 17, loss 0.0122 
## INFO  [07:16:09.882] epoch 18, loss 0.0120 
## INFO  [07:16:13.854] epoch 19, loss 0.0118 
## INFO  [07:16:17.817] epoch 20, loss 0.0116
charles_i_wv_context = charles_i_glove$components

charles_i_word_vectors = charles_i_wv_main + t(charles_i_wv_context)
commonwealth_words_ls = list(commonwealth$word)
it = itoken(commonwealth_words_ls, progressbar = FALSE)
commonwealth_vocab = create_vocabulary(it)
commonwealth_vocab = prune_vocabulary(commonwealth_vocab, term_count_min = 5)

vectorizer = vocab_vectorizer(commonwealth_vocab)

# use window of 10 for context words
commonwealth_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

commonwealth_glove = GlobalVectors$new(rank = 100, x_max = 100)

commonwealth_wv_main = commonwealth_glove$fit_transform(commonwealth_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [07:16:21.651] epoch 1, loss 0.0547 
## INFO  [07:16:23.346] epoch 2, loss 0.0315 
## INFO  [07:16:25.043] epoch 3, loss 0.0248 
## INFO  [07:16:26.747] epoch 4, loss 0.0215 
## INFO  [07:16:28.480] epoch 5, loss 0.0193 
## INFO  [07:16:30.188] epoch 6, loss 0.0178 
## INFO  [07:16:31.882] epoch 7, loss 0.0167 
## INFO  [07:16:33.589] epoch 8, loss 0.0157 
## INFO  [07:16:35.297] epoch 9, loss 0.0150 
## INFO  [07:16:37.015] epoch 10, loss 0.0143 
## INFO  [07:16:38.720] epoch 11, loss 0.0138 
## INFO  [07:16:40.415] epoch 12, loss 0.0133 
## INFO  [07:16:42.126] epoch 13, loss 0.0129 
## INFO  [07:16:43.818] epoch 14, loss 0.0125 
## INFO  [07:16:45.518] epoch 15, loss 0.0121 
## INFO  [07:16:47.224] epoch 16, loss 0.0118 
## INFO  [07:16:48.938] epoch 17, loss 0.0115 
## INFO  [07:16:50.671] epoch 18, loss 0.0113 
## INFO  [07:16:52.384] epoch 19, loss 0.0110 
## INFO  [07:16:54.108] epoch 20, loss 0.0108
commonwealth_wv_context = commonwealth_glove$components

# dim(shakes_wv_context)

# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
commonwealth_word_vectors = commonwealth_wv_main + t(commonwealth_wv_context)
charles_ii_words_ls = list(charles_ii$word)
it = itoken(charles_ii_words_ls, progressbar = FALSE)
charles_ii_vocab = create_vocabulary(it)
charles_ii_vocab = prune_vocabulary(charles_ii_vocab, term_count_min = 5)

vectorizer = vocab_vectorizer(charles_ii_vocab)

# use window of 10 for context words
charles_ii_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

charles_ii_glove = GlobalVectors$new(rank = 100, x_max = 100)

charles_ii_wv_main = charles_ii_glove$fit_transform(charles_ii_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [07:17:07.576] epoch 1, loss 0.0503 
## INFO  [07:17:13.682] epoch 2, loss 0.0286 
## INFO  [07:17:19.731] epoch 3, loss 0.0231 
## INFO  [07:17:25.800] epoch 4, loss 0.0205 
## INFO  [07:17:31.863] epoch 5, loss 0.0188 
## INFO  [07:17:38.012] epoch 6, loss 0.0176 
## INFO  [07:17:44.127] epoch 7, loss 0.0167 
## INFO  [07:17:50.195] epoch 8, loss 0.0159 
## INFO  [07:17:56.319] epoch 9, loss 0.0153 
## INFO  [07:18:03.664] epoch 10, loss 0.0148 
## INFO  [07:18:15.506] epoch 11, loss 0.0144 
## INFO  [07:18:27.367] epoch 12, loss 0.0140 
## INFO  [07:18:36.653] epoch 13, loss 0.0137 
## INFO  [07:18:42.707] epoch 14, loss 0.0134 
## INFO  [07:18:48.747] epoch 15, loss 0.0131 
## INFO  [07:18:54.798] epoch 16, loss 0.0129 
## INFO  [07:19:04.634] epoch 17, loss 0.0127 
## INFO  [07:19:16.181] epoch 18, loss 0.0125 
## INFO  [07:19:26.842] epoch 19, loss 0.0123 
## INFO  [07:19:36.290] epoch 20, loss 0.0121
charles_ii_wv_context = charles_ii_glove$components

# dim(shakes_wv_context)

# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
charles_ii_word_vectors = charles_ii_wv_main + t(charles_ii_wv_context)
james_ii_w_m_ann_words_ls = list(james_ii_w_m_ann$word)
it = itoken(james_ii_w_m_ann_words_ls, progressbar = FALSE)
james_ii_w_m_ann_vocab = create_vocabulary(it)
james_ii_w_m_ann_vocab = prune_vocabulary(james_ii_w_m_ann_vocab, term_count_min = 5)

vectorizer = vocab_vectorizer(james_ii_w_m_ann_vocab)

# use window of 10 for context words
james_ii_w_m_ann_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)

james_ii_w_m_ann_glove = GlobalVectors$new(rank = 100, x_max = 100)

james_ii_w_m_ann_wv_main = james_ii_w_m_ann_glove$fit_transform(james_ii_w_m_ann_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO  [07:19:41.941] epoch 1, loss 0.0512 
## INFO  [07:19:44.506] epoch 2, loss 0.0294 
## INFO  [07:19:47.146] epoch 3, loss 0.0235 
## INFO  [07:19:49.760] epoch 4, loss 0.0206 
## INFO  [07:19:52.385] epoch 5, loss 0.0187 
## INFO  [07:19:55.017] epoch 6, loss 0.0174 
## INFO  [07:19:57.628] epoch 7, loss 0.0163 
## INFO  [07:20:00.242] epoch 8, loss 0.0155 
## INFO  [07:20:02.864] epoch 9, loss 0.0148 
## INFO  [07:20:07.515] epoch 10, loss 0.0142 
## INFO  [07:20:12.153] epoch 11, loss 0.0137 
## INFO  [07:20:16.802] epoch 12, loss 0.0133 
## INFO  [07:20:21.473] epoch 13, loss 0.0129 
## INFO  [07:20:26.118] epoch 14, loss 0.0126 
## INFO  [07:20:30.826] epoch 15, loss 0.0122 
## INFO  [07:20:35.493] epoch 16, loss 0.0120 
## INFO  [07:20:40.163] epoch 17, loss 0.0117 
## INFO  [07:20:44.831] epoch 18, loss 0.0115 
## INFO  [07:20:49.503] epoch 19, loss 0.0113 
## INFO  [07:20:54.145] epoch 20, loss 0.0111
james_ii_w_m_ann_wv_context = james_ii_w_m_ann_glove$components

# dim(shakes_wv_context)

# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
james_ii_w_m_ann_word_vectors = james_ii_w_m_ann_wv_main + t(james_ii_w_m_ann_wv_context)

Write a function as above, this time with two arguments, so we can specify both the word and the relevant reign:

top_ten_function = function(word, period){
  
  
  if(period == 'james_i'){
    
  vectors = james_i_word_vectors[word, , drop = FALSE] 
  cos_sim = sim2(x = james_i_word_vectors, y = vectors, method = "cosine", norm = "l2")


}
  else if(period == 'charles_i'){  vectors = charles_i_word_vectors[word, , drop = FALSE] 
  cos_sim = sim2(x = charles_i_word_vectors, y = vectors, method = "cosine", norm = "l2")
  
  } 
  else if(period == 'commonwealth')  { 
    
    vectors = commonwealth_word_vectors[word, , drop = FALSE] 
  cos_sim = sim2(x = commonwealth_word_vectors, y = vectors, method = "cosine", norm = "l2")
  
  }
  
  else if(period == 'charles_ii'){
    
    vectors = charles_ii_word_vectors[word, , drop = FALSE] 
  cos_sim = sim2(x = charles_ii_word_vectors, y = vectors, method = "cosine", norm = "l2")
  
  }
  
  else {
    
  vectors = james_ii_w_m_ann_word_vectors[word, , drop = FALSE] 
  cos_sim = sim2(x = james_ii_w_m_ann_word_vectors, y = vectors, method = "cosine", norm = "l2")
  }
  
head(sort(cos_sim[,1], decreasing = TRUE), 20)


}

Write a second function, which takes a word and returns the ten closest words for each reign:

first_in_each= function(word) {
  
  rbind(top_ten_function(word, 'james_i') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='james_i' ),
     top_ten_function(word, 'charles_i') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='charles_i' ),
     top_ten_function(word, 'commonwealth') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='commonwealth' ),
     top_ten_function(word, 'charles_ii') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='charles_ii' ),
     top_ten_function(word, 'james_ii_w_m_ann') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='james_ii_w_m_ann' ))%>% 
  group_by(reign) %>% 
  mutate(rank = rank(value)) %>% 
  ggplot() + 
  geom_text(aes(x = factor(reign, levels = c('james_i', 'charles_i', 'commonwealth', 'charles_ii', 'james_ii', 'james_ii_w_m_ann')), y = rank, label = name, color = name)) + theme_void() +
  theme(legend.position = 'none', 
        axis.text.x = element_text(face = 'bold'), 
        ) 
  
  }

This can show us the changing associations of particular words over time. Take ‘match’:

first_in_each('match') 

In the reign of James I, ‘match’ is semantically linked to words relating to the Spanish Match: a proposed match between Charles I and the Infanta Maria Anna of Spain. During Charles I’s reign and afterwards, the meaning changes completely - now the closest words are all military. In the final section of the data, the semantic link returns again to mostly words about marriage - this time it’s not so obvious why the words are associated, but it’s probably relating to the marriage of Philippe II, Duke of Orléans to Françoise Marie de Bourbon, in 1692 - Philippe II was regent of France until 1723.

Conclusions

The primary purpose of this technique in the ‘real world’ isn’t really to understand the semantic relationship between words for its own sake, but rather is most often used as part of an NLP pipeline, where the embeddings are fed through a neural net to make predictions about text.

However, the word embeddings trained on the text of the Calendars is still a useful way to think about how these texts are constructed and the sort of ‘mental map’ they represent. We’ve seen that it often produces expected results (such as King being closest to Majesty), even in complex tasks: with the analogy pen is to letter as X is to book, X is replaced by ink, printer, pamphlet, and some other relevant book-production words. Certain words can be seen to change over time: match is a good example, which is linked to marriage at some times, and weaponry at others, depending on the time period. Many of these word associations reflect biases in the data, but in certain circumstances this can be a strength rather than a weakness. The danger is not investigating the biases, but rather when we are reductive and try to claim that the word associations seen here are in any way representative of how society at large thought about these concepts more generally. On their own terms, the embeddings can be a powerful historical tool to understand the linked meanings within a discrete set of sources.